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Abstract 

We consider the smoothing probabilities of hidden Markov model (HMM). We show that under fairly general condi- 
tions for HMM, the exponential forgetting still holds, and the smoothing probabilities can be well approximated with 
the ones of double sided HMM. This makes it possible to use ergodic theorems. As an applications we consider the 
pointwise maximum a posteriori segmentation, and show that the corresponding risks converge 
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1. Introduction 



Let Y = {Yk} 00 ^ be a double-sided stationary MC with states S = [1,...,K] and irreducible aperiodic transition ma- 
trix (P(i,j)). Let X = {Xjc}™^ be the (double-sided) process such that: 1) given {5^} the random variables {X^} are 
conditionally independent; 2) the distribution of X„ depends on [Yk] only through Y„. The process X is sometimes 
called the hidden Markov process (HMP) and the pair (Y, X) is referred to as hidden Markov model (HMM). The name 
is motivated by the assumption that the process Y (sometimes called as regime) is non-observable. The distributions 
Pi := P(X\ e -\Y\ = i) are called emission distributions. We shall assume that the emission distributions are defined on 
a measurable space (X, S), where X is usually R (/ and S is the Borel cr-algebra. Without loss of generality, we shall 
assume that the measures P, have densities /; with respect to some reference measure yt. Since our study is mainly 
motivated by statistical learning, we would like to be consistent with the notation used there (X for the observations 
and Y for the latent variables) and therefore our notation differs from the one used in the HMM literature, where 
usually X stands for the regime and Y for the observations. 

HMM's are widely us ed in various fields of applicat ions, including speech recognition dRabinerM 19891: IJelinekL 
200 II) . bioinformatics (Koski, 2001: iDurbin et all 119981) . language process ing, image analysis dLi et all 12000) and 
many others. For general o verview about HMM's, we refer to textbook dCappe et all I2.Q05D and overview paper 



(Ephraim and Merhav, 2002). 

The central objects of the present papers are the smoothing probabilities P(Y, - s\X z , . . . ,X„), where t,z, n € Z 
and s E S . They are important tools for making the inferences about the regime at time t. By Levy's martingale 
convergence theorem, it immediately follows that as n — > oo, 



P(Y, = s\X z , ...,*„)-> P(Y, = s\X z , ...)=: P(Y, = s\X?), a.s.. 



(1) 



Let P(Y t e -\X™) denote the /f-dimensional vector of probabilities from the right side of (Q}. By martingale conver- 
gence theorem, again, as z — > -oo 

PiY, = s\X°°) -» PiY, = s\ ■ ■ ■ ,X- U Xo,X u ...)=: P(Y, = sVCJ, a.s.. 



The double-sided smoothing process {P(Y t e ^X 00 ^)}^ is stationary and ergodic, hence for this process the ergodic 
theorems hold. To be able to us these ergodic theorems for establishing the limit theorems in terms of smoothing 



Email address: jyril@ut.ee (Jiiri Lember) 
'Estonian science foundation grant no 7553 
Preprint submitted to Statistics and Probability letters 



May 11, 2011 



probabilities P(Y t = s\X z , . . . ,X n ) , it is necessary to approximate it with double-sided smoothing process. This ap- 



proach is, among others, used in (Bic kel et aU 11998). In other words, we are interested in bounding the difference 
\\P(Y t e -\X") - P(Y, e where || • || stands for total variation distance. Our first main result, Corollary 12. II states 

that under so-called cluster assumption A, there exists a bounded random variable C and a constant p () E (0, 1) such 
that for every z%,Z\,t,n such that zi < Z\ < 1 < t < n, 



\\P(Y, e-\K)-P(Y t e-\Xl)\\<C lPl 



a.s.. 



Similar results can be found in the literatu re for the specia l case, where transition matrix has a l l positive entries or 
the em ission densities f are all positive (IGland and MevelLl200(i iGerencser and Molnar-S aska. 2002; Cappeetal 
2005|). Both conditions are restrictive, and the assumption A relaxes them. 



We go one more step further, considering the approximation of smoothing probabilities with two-sided limits. Our 
second main result, Theorem l2.1l states that under A, for every z<l<t<k<n 

\\P(Y t e - P(Y, e -1X^)11 < dp'- 1 + C' kP k -' a.s., 

where p„ e (0, 1) is a fixed constant, C\ is a finite random variable as in the previous bound and {C'A is a finite ergodic 
process. Of course, without the ergodic property, the existence of \C' k ) would be trivial, however, as shown in the 
proof of Theorem l3.11 the ergodic prop erty makes the bound useful in applica tions. 

The condition A was introduced in dLember and Kolovdenkoi 1201 Ol 120081) . where under the conditions slightly 
stronger than A the existence of infinite Viterbi alignment was shown. The technique used in these papers differs 
heavily rom the one in the present paper, yet the same assumption appears. This implies that A is indeed essential for 
HMM's. 

Our motivation in studying the limit theorems of smoothing processes comes from the segmentation theory. Gen- 
erally speaking, the segmentation problem consists of estimating the unobserved realization of underlying Markov 
chain Y\,...,Y n given the n observations from from HMP X\,...,X„: x" :- x\,...,x n - Formally, we are looking 
for a mapping g : X n — > S n called classifier that maps e very sequence of observations into a state sequence, see 
( Kolo vdenko and L embei , 20101 : Kuljus and Lembei , 2010h for details. For finding the best g, it is natural to associate 
to every state sequence s" e S" a measure of goodness R(s"\x"), referred to as the risk of s" . The solution of the 
segmentation problem is then the state sequence with minimal risk. In the framework of pattern recognition theory, 
the risk is specified via loss-function L : S" x S" — » [0, oo], where L(a n , b") measures the loss when the actual state 
sequence is a" and the prognosis is b n . For any state sequence s" e S", the risk is then 



R(s"\x") := E[L(Y", s")\X" = x"] = J] L(a", s n )P(Y n = a"\X" = x n ). 



(2) 



a"eS" 



In this paper, we consider the case when the loss function is given as 

1 - 

L(a", b n ) = - V Kfli, bd, 
n 



(3) 



where Kfli, bj) > is the loss of classifying the 2-th symbol a, as £>,-. Typically, for every state s, l(s, s) - 0. Most 
frequently, l(s, s') = /{^/j and then the risk R(s"\x") just counts the expected number of misclassified symbols. 
Given a classifier g, the quantity R(g, x n ) := R(g(x n )\x") measures the goodness of it when applied to the observations 
x". When g is optimal in the sense of risk, then R{g, x) = min s » R{s"\x n ) -: R(x"). We are interested in the random 
variables R(g,X"). When g is maximum likelihood classifier - so called Viterbi alignment - and HMM satisfies A, 
then (under an add i tional mild assumption), i t can be shown that there exists a constant such that R(g, X") — > R v 
a.s. dCariebell2006t iLember and Kolovdenkoi 2010h . In this paper, we show that under A, the similar results holds for 
optimal alignment: there exists a constant R such that R(X") —> R, a.s.. Those numbers (clearly R v > R) depend only 
on the model and they measure the asymptotic goodness of the segmentation. If l(s, s') = /{^ S '), then R v and R are the 
asymptotic symbol-by symbol misclassification rates when Viterbi alignment or the best alignment (in given sense) 
are used in segmentation, respectively. 
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2. Approximation of the smoothing probabilities 



2.1. Preliminiaries 

Throughout the paper, let x v u where u, v e Z, u < v be a realization of X u , . . . ,X V . We refer to x v u as the observations. 
When u = 1, then it is omitted from the notation, i.e. x" := x". Let p(x l u ) stand for the likelihood of the observations 
x v u . For every u < t < v and s e S , we also define the forward and backward variables a{x' u , s) and f}(x v f+ ,\s) as follows 



s) := p(x' u \Y, = sW, = s), : 



1, if* = v; 



Here p(x' u \Y t = s) and p(jcj" jl^j = s) are conditional densities (see also lEphraim and Merhavi 120021) . The backward 
variables can be calculated recursively (backward recursion): 



ieS 

For every f e Z, we shall denote by n t [x l u ] the /f-dimensional vector of conditional probabilities P(Y, e -\Xl = x l u ). Our 
fir st goal is to bound t he difference n t [x" x ] - jt,[x" 2 ], where zi < Z\ < I < t < n. For that, we shall follow the approach 
in dCappe et all 120051) . It bases on the observation that given the observations x", the underlying chain Y z , . . . , Y„ is a 
conditionally inhomogeneous MC, i.e for every z < k < n and j € S 

P(Y k+l = j\Y k z = y k z ,X" z = 4) = P(Y k+l = j\Y k = y k ,X n z = 4') = : F k (y k J), 

where for every ; e S, F(i j) is called the forwa rd smoothing proba bility ( Cappe et"all 2005 , Prop 3.3. 2), also 
dEphraim and Merhavi E002I (5.2.1)). It is known dCappe et all I200I (5.2V)\ dEphraim and Merhavi 120021 (3.30)) 
that if ^(x" JO > 0, then 



F k (i,j) = 



P(iJ)fj(x k+l )P(x" k+2 \j) 



(4) 



When f3{x" k+y \i) - 0, we define F k (i,j) — Vj e S . Note that the matrix F k depends on the observations x" k+v only. 
This dependence is sometimes denoted by F k [x n k+l \. With the matrices F k , for every t such that z < t < n, it holds (e.g. 



Cappe et al., 2005, (4.30)) 



n , t [x^]=7t' z [x n z ](ljF i [x^ +1 ]), 
where ' stands for transposition. For n > t > 1 > z\ > Z2, thus 

t-l 

{n t [xl]-n,[xl])' = ( Wl [4] -7td4J)'Y] F iW + il 



(5) 



(6) 



Let 7t\ and 112 be two probability measures on S . If A is a transition matrix on S , then A'jij, (i = 1,2) is a vector 
that corresponds to a probability measure. We are interested in total variation distance between the measures A'n\ 
andA ; 7T7 . The appr oach in this paper uses the fact that the difference between measures can be bounded as follows 
dCappe etalll2005l Cor. 4.3.9) 



llA'iri -A'n 2 \\ = ||A'(;ri -n 2 )\\ < Iki -n 2 \\6(A), 
where 5(A) is Dobrushin coefficient of A defined as follows 

tf(A):= isup||A(i,.)-A(/ t .)ll. 

1 UjeS 



(7) 



Here, A(i, •) is the i-th row of the matrix. Hence, applying (|7]l to (O, we get dCappe et alll2005l Prop. 4.3.20) 

(-1 r-i 



(8) 



Another useful fact is that for two transition matrices A, B, it holds (see, e.g. Cap pe et al. L|2005|,Prop4.3.10) 5(AB) 



< 



S(A)S(B), hence, the right hand side of ([8]) can be further bounded above with 2 ]"['=! 5(Fj[x" +l ]). The Dobrushin 
coefficient of A can be estimated above using so-called Doeblin condition: If there exists e > and a probability 
measure A = (Ai, . . . , Ak) on S such that 

MiJ)> £ Aj, ViJeS, (9) 
then 6(A) < 1 - e dCappe et al. . 120051 Lemma 4.3.13). This condition holds, for example, if all entries of A are 



positive. If Fi satisfies Doeblin conditions, then the right hand side converges to zero exponentially with t. 
2.2. Cluster-assumption 

Recall that /; are the densities of P(X\ e -\Y\ = z) with respect to some reference measure p on (X, 23). For each i £ S , 
let Gj :- {x € X : ft(x) > 0}. We call a subset C <z S a cluster if the following conditions are satisfied: 

minP/n, eC G,) > 0, and maxP/n ieC G,) = 0. 

jeC jtC 

Hence, a cluster is a maximal subset of states such that Gq = n, e cG,, the intersection of the supports of the corre- 
sponding emission distributions, is 'detectable' . Distinct clusters need not be disjoint and a cluster can consist of a 
single state. In this latter case such a state is not hidden, since it is exposed by any observation it emits. When K = 2, 
then S is the only cluster possible, since otherwise the underlying Markov chain would cease to be hidden. 
Let C be a cluster. The existence of C implies the existence of a set X,, c rij € cGj and e > 0, K < oo such that p(X,,) > 0, 
and Vx e X , the following statements ho ld: (i) e < min I€ c fflx); (ii) maxj € c/i(x) < K; (iii) max^ c fj(x) = 0. For 



proof, see ( Lember and K oloyden kolfeoiOl) . 



Assumption A: There exists a cluster C c S such that the sub-stochastic matrix R = (P(i, j))i,jec is primitive (i.e. 
there is a positive integer r such that the rth power R is strictly positive ). 

Clearly assumption A is satisfied, if the matrix P has all positive elements. Since any irreducible aperiodic matrix is 
primitive, the assumption A is also satisfied, if the the densities /j satisfy the following condition: For every x e X, 



min,- e s fi(x) > 0, i.e. for all i e S , G, = X. Thus A is more general than the strong mixing condition dCappe et al 



Assumption 4.3.21) and (Cappe et all 120051 Assumption 4.3.29). For more general discussion about A, see 



2005 _ _ 

dLember and Kolovdenko[|2008ll2010l) " 



In the following, we assume A. Let C be the corresponding cluster, and let X be the corresponding set. 
Proposition 2.1. Let x n k+l be such that p{x n k+x ) > and x^ e X r . Then 

n rl min; iR r (i, f) / e \r 
FiK + J) * 1 ^T^ IF -: p < I. (10) 

Proof. Let A :- YIIZq Fk+i- Using backward recursion, it follows that for every r > 1 

A y ^ _ £/, • "Zin P(i,i\)fi,{x k+l ) . . . fi^iXk+r-tiPQr-U j)fj(Xk+r)B(x n k+r+l \j) 

Ej Hi, ■ ■ • 2Vi p d< ■ • ■ A-, (xk+ r -i)P(i r -\ , j)B(x" k+r+l \j) 

Since xVZ e X' , then by (iii), (ii) and (i), thus for every i, j 6 S 

P(i, h)fh te+i) ' ' ' P(h-u j)fj(x k+r )B(x" k+r+l \j) 



A(i,f) = 



P(i, h)A (x k+ i) ■ ■ ■ P(i r -u j)fj(x k+r )B(4 +r Jj) 
,ey ( Z h ec [ ; • Z,y_, ec Pih h) ■ ■ - PHr-i , TjjgC^iLO 

E; E c(2i ieC • ■■Z ir _ lEC P(i,h) ■ ■■P(h-uj))B(x" k+r+l \j) 
(£ y R r dJ)B(x" k+r+l \j) ^ min uj R r (i, j) (€]r B(x" k+r+1 \j) 
[ K> Ey ^ r ('. M^LO " max y fi^ft j) S 7 /3(^ +r+1 Ij) ^ 7 ' 

4 



where 
Since 

P(4 + i) = y Z<*(4XvM4 +r+ i\j)>o> 

j 

there must be a j e S such that B(x" +r+l \j) > 0. So (^j)jes is a probability measure and Doeblin condition holds. □ 

Lemma 2.1. Let x n z be the sequence of observations with positive likelihood, i.e. p(x") > 0. Then, for every t such 
thatzi < Z\ < 1 < t < n, 

\\n t [x" Z] ]-xA4 2 ]\\^ 2 P K(X,,) > (U) 

where p € (0, 1) ;s as in ( 17 OP goto? 

f-2 ^ 



u=0 

Proof. Recall that for two transition matrices A, B, 5{AB) < 5{A)5{B), so 

t-l j-l (u+l)r l-l j-l {u+l)r j-l 

im-iuiu *) n *)*n« n «>-n*u- 

(=1 w=0 i=wr+l i=jr+l zi=0 i=wr+l m=0 

where A„ : = Y\Ztr+i ^tC^]- F rom Proposition ^. II with k = ur+l, 

( o if ^ ( " r+1)+r 6 X r - 
( 1, else. 

From ©, it holds 

t-i ./-i 
IItt,^ 1 ,] - 7T t [x" Z2 ]\\ < 2S( Y] FiW +l ]) < 2 Y] S(A U ) < 2p^\ 



□ 



Let s\ e C. By irreducibility and cluster assumption, there is a path s\,...,s r+ \ such that s, e C and P(Y\ - 
Si, . . ., Kr+i = > 0. By (i), for any s 2 , . . . , s r+ i e C, it holds i 5 ^" 1 " 1 € X' \Y 2 = *2» • • T r+ i = *r+i) > implying 

tationarity of X, for every fc > 0, it ho] 

process {X n }„>\ is ergodic, so 



that P(X r 2 +1 € > 0. By stationary of X, for every k > 0, it holds e <Y£) = P(X k k +[ e =: p r > 0. The 



/f(Zf) 1 k{X\) „,. 

lim — — = lim — = — > 0, a.s.. (12) 

f->oo t t->oo r j(t) r 

Corollary 2.1. Assume A. Then, there exists a non-negative finite random variable C\ as well as constant p\ € (0, 1) 
such that for every z, t, n such that zi < Z\ < 1 < t < n, 

\\P(Y t e -\X" Zi ) - P{Y, e .IX^)|| < dp,'" 1 , a.s.. (13) 

Proof. The right hand side of ( fTTT ) does not depend on n (as soon as it is bigger than f), hence from Lemma lXTI 

sup \\P(Y, e -\X' Z \) - P{Y, e .|X£)|| < 2p K ^\ a.s.. 

n>t 

Thus, if t — > oo then by (TTZb . it holds 

limsupilogfsupll^F^-IX^-P^e-IZpllj^^logp, a.s.. (14) 

t t n>t T 



Let g be such that — logp < g < 0. Let 



log 2 + ^X1) log p 



T(oj) := max{f > 1 : — — — >£>}. (15) 

From (TBI , it follows that for almost every w, < co and for t > T(cS), 

log(sup||P(F, e -\X^)(oj)-P(Y t e -\XI)(oj)\\) < tg 

n>t 

and, hence, for t > T((l>) and n > t, 

TO € -|XJ) - P{Y, £ -|^)|| < «• = (pi)', 

where pi : = e e . The inequality (fT3l l holds with C\ : = 2p~ T+l . □ 

The ine quality (14} is similar to Theorem 2 .2 in ( Gland and Mevel , 2000l) . The forgetting equation in form ( TT3l l is 
used in dGerencser and Molnar-Saska , 2002 ). 

Corollary 2.2. There exist a constant p\ e (0, 1) and an ergodic process {CJ! ^ so that for any zi < Z\ < z < t < n 

TO e -\X" Zi ) - P(Y, e -|^)ll < Qpi'^, a.s.. (16) 

Proof. The existence of C z follows exactly as in case z = 1. The ergodicity of {C z } follows from the fact that the 
random variables \C Z } are stationary coding of the ergodic process X. □ 

Theorem 2.1. Assume A. Then there exist a constant p a e (0, 1) and an finite ergodic process {C!}™_ M so that for 
every z < 1 < t < k < n 

WPiY.e-lX'D-PiYtt-^JWKdpi; 1 +C' k p k -' a.s., (17) 



where C\ is a finite random variable as in Corollarv \2.2\ 

Proof We reverse the time by defining Y[ = Y. k , X' k = X_ k . Thus, P(Y'_, e -\X'j n ) = P(Y, e It is easy to see 

that when HMM (Y,X) satisfies A, then so does the reversed HMM (Y',X'). From Corollary|22] it follows that there 
exists p2 € (0, 1) and ergodic process {C_,} so that for any —112 <—n\ < — k < — t < — 1 < —z 

\\P(Y, € -\X?) - P(Y, e -\X?)\\ = \\P(Y'_ t e -IX'!^) - P(r , € -\X'-_ z ni )\\ < C"_ k p+ { - k) = C>*-< = Cj^"', a.s 

where C' z : = C' . The right side does not depend on n\, «2 and z. Hence, letting z — » — °° and using Levy martingale 
convergence theorem, for every 1 < t < k < n\ <m 

\\P(Y t € ) - € < C' k p k 2 -', a.s.. 

Letting now «i — » oo and using Levy martingale convergence theorem again, for every 1 < t < k < n 

\\P(Y t € .|CJ - />(F ( € -|Xrj|| < q/ 2 -', a.s.. (18) 
Applying the same theorem to (fT3l l. with Z2 — > -°° and z = Zi, we get that for every z < 1 < / < n, 

\\P{Y, € .|X! J - P(Y, € -|XJ)|| < dp'f 1 a.s.. (19) 
Hence, with, p = max{pi,p2}, from the inequalities ( fT8l and (fT9l l. the inequality dTvT > follows. □ 
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3. Convergence of risks 

Recall that / : S x S — > [0, oo) is the pointwise loss. Let, for any 1 < t < n, and s e S , 

R,(s\x") = E[l(Y„ s,)\X" = x"] = Y /(a, s)P(T, = a|X" = *"). 



aeS 



Thus, /?,(s|x") is the conditional risk of classifying Y, = s given the observations x". The risk of the whole state 
sequence s" as defined in (O with L as in (0 is easily seen to be 

R(s n \x n ) = -Y j R t (s t \x n ). 
i=i 

Let for every t e Z and s € 5 , 

tf^ixrj := E[l(Y r , S )|r°j = J] s ^ p(Y > = fl i*-~)- 

aeS 

For f > 1, thus 

|*,(5|X!J - R,(s\X'l)\ < l(s)\\P(Y, e -\X>{) - P(Y, € -I^OH, (20) 
where l(s) = max fl l(a, s). Finally, recall that R(x") := min s » 

Theorem 3.1. Suppose A holds. Then there exists a constant R such that R(X") — > a.s. ana! z/i Li. 
Proof. The process X is ergodic, so for a constant R, 

1 " 

- V minRMX^J R, a.s. and in L x . (21) 

Let M < oo be such that P(C' n < M) =: q > 0. Let, for every n, k(n) = max{& < n : C' k < M}. Since the process C" is 
ergodic, in the process n — > oo, £(«) — > oo, a.s. From (1201 1. it follows, that with A := max fl! l(a, s), 

\minR t (s\X?J - mmR,(s\X'l)\ < A\\P(Y, e -\X'() - P{Y, e .|XTJ||. 

S S 

Hence 

1 n A n A A 

\R(X")— V miii RMXr x )\ < - V ||P(y, e -IX^-PCy, e -1X^)11 < - V ||P(T, e -\X»)-P(Y, e .|X!° 00 )II+-2(n-A ; (n)). 
n ses n t— 1 n t— 1 n 

i=i i=i t=\ 

(22) 

By inequality d!71 l. for every 1 < t < k(n) 



k(n) k(n) k(n) co 

2 \\P(Y, e -IX'/) - P(y, e -|X!°J|| < d ^p!; 1 + M^pf/ n) -' < (d + M) £p£ < oo, a.s.. 

r=l t=l t=l n=0 

Let ri := min{; > : C[ < M), tj : = min{/ > Tj_i : C ( ' < M}. Since {C£} is ergodic, the random variables 
Tj = Tj+\ — Tj, j = 1,2,... are identically distributed. By Kac' return time theorem, ETj = q~ l . Finally, denote 

j(n) = max)./' : tj < n\. 

Thus k(n) = t/(„) and n - k(n) < Tj( n ). Since Tj is a.s. finite, clearly j(n),k(n) — > oo as n grows. From the finite 
expectation of ETj, it follows that 

T i 

— -> 0, a.s., 

implying that 

n - k(n) Tj(n) 

< —-- -> 0, a.s. (23) 

n j(n) 

Hence, the right hand side of d22l i goes to 0, a.s. and from (fJTJ, it now follows that R(X") — > R a.s. Risks are 
nonnegative, so the convergence in L\ follows from Sheffe's lemma. □ 
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Given I, the constant R - asymptotic risk - depends on the model, only. It measures the average loss of classifying 
one symbol using the optimal classifier. For example, if / is symmetric, then the optimal classifier (in the sense of 
misclassification error) makes in average about Rn classification errors. Clearly this is the lower bound: no other 
classifier does better. The constant R might be hard to determine theoretically, but Theorem B . 1 1 guarantees that it can 
be approximated by simulations. 
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